Base Notebook

Here are some notes on decisions I have made at different stages in this notebook (let us know in the group chat if you want to make any changes to these steps or have any feedback/concerns):

Target Class

Missing Values

Feature Selection

Transforming and Scaling

Models

HyperParameters

Metrics

Exploratory Data Analysis

We can see that the first trip for both cars is almost twice as long as the second trip.

Let's explore the individual datasets before joining them. First, let's take a look at the structure of the datasets.

Attributes

14 Numeric Attributes:

AltitudeVariation - altitude change calculated over 10 seconds;

VehicleSpeedInstantaneous - current speed value;

VehicleSpeedAverage - average speed in the last 60 seconds;

VehicleSpeedVariance - speed variance in the last 60 seconds;

VehicleSpeedVariation - speed variation for every second of detection;

LongitudinalAcceleration - measured by the smartphone accelerometer and pre-processed with a low-pass filter;

EngineLoad - expressed as a percentage;

EngineCoolantTemperature - in celsius degree;

ManifoldAirPressure - (MAP), a parameter the internal combustion engine uses to compute the optimal air/fuel ratio;

EngineRPM - Revolutions per Minute of the engine;

MassAirFlow - (MAF) Rate measured in g/s, used by the engine to set fuel delivery and spark timing;

IntakeAirTemperature - (IAT) at the engine entrance;

VerticalAcceleration - measured by the smartphone accelerometer and pre-processed with a low-pass filter;

AverageFuelConsumption - calculated as needed liters per 100 km.

3 Categorical Attributes

roadSurface - 3 classes: SmoothCondition, FullOfHolesCondition, UnevenCondition;

traffic - 3 classes: LowCongestionCondition, NormalCongestionCondition, HighCongestionCondition;

drivingStyle - 2 classes: EvenPaceStyle, AggressiveStyle.

Checking for missing values.

There are only a few missing values in the Peugeot datasets, we can deal with these at the preprocessing stage.

Correlation between numerical variables

Let's look at correlation between the numerical variables in each dataset.

Some initial observations:

Distribution of numerical variables

Now let's have a look at the distribution of the numerical variables in each dataset.

For each dataset, the graphs for VehicleSpeedInstantaneous, VehicleSpeedVariance, ManifoldAbsolutePressure, EngineRPM and MassAirFlow are really right-skewed, while EngineCoolantTemperature has a long left tail in each of the datasets. We may need to do some kind of transformations on these variables to try create more normal distributions.

We can also see there is a clear difference in the shape of the distribution for VerticalAcceleration and LongitudinalAcceleration between the opel_02 dataset and the others. Let's take a closer look.

We can see that there are two distinct peaks for both variables at the same time in the opel_02 journey (could be caused by the smartphone accelerometer malfunctioning at these times?) The higher values for each of these variables in the opel_02 dataset could cause the overall dataset to have some outliers, so we should keep an eye on these variables later.

Comparing the distribution of categorical variables in each dataset.

In general, the distribution of driving styles in each dataset seems to be quite similar, although the Peugeot datasets have about 15% less aggressive style driving compared to the Opel datasets.

Now let's look at the roadSurface.

From the statistics and charts above we see there is a big inbalance between the datasets for this target class. While the Opel trips were driven on mostly smooth roads, the two Peugeot trips were on considerably worst road surfaces, especially the second Peugeot trip where only 6% on the roads were in a smooth condition. We would need to accomodate for this inbalance when splitting the final dataset into train and test sets if we choose this as our target class for the model.

The last category class we will compare is traffic.

While the traffic congestion conditions for the first three datasets are quite similar, there was clearly much more congestion during the last trip. Let's visualize the vehicle speed during the different trips to see if if the high traffic for trip 4 is noticable.

From these charts we can see that when there is mainly low traffic congestion the vehicle can go at greater speeds, but during trip 4 there was too much medium to high traffic congestion to allow the vehicle to reach higher speeds.

So we can make an assumption that once the speed goes over a certain threshold, there is a higher probability that there is low traffic.

Let's see if it's the same for road surface, do higher speeds indicate that the road surface was probably in a better condition?

In the first 2 trips, high speeds seem to go hand-in-hand with smooth roads, but in the third trip (peugeot_01) there were slightly higher speeds when the road was full of holes compared to just an uneven condition, and for the fourth trip (peugeot_02) higher speeds were reached on both roads with uneven condition and roads full of holes, compared to smooth roads, so maybe speed isn't quite as reliable as a predictor for road surface condition as it is for traffic congestion.

Now let's use one-hot encoding to turn the categories into numerical values and further explore how they are correlated to the other variables using correlation matrices and heatmaps.

Comparing the correlation with roadSurface classes and the rest of the dataset.

Comparing the correlation with traffic classes and the rest of the dataset.

Preprocessing

Choosing a target class

The classes for roadSurface seem to be slightly more balanced than the distribution of classes for traffic, so let's choose roadSurface as our target.

Also the minority class for roadSurface ('FullofHolesCondition') might be a bit easier to predict than traffic's smallest class ('Medium') as it's more correlated with other variables.

Because there is an imbalance of the classes in roadSurface, we will need to use a stratified train/test split to ensure that the y sets have a similar distribution of classes.

Splitting the Data into Training/Testing

There aren't that many missing values so they are most likely missing at random, although there seems to be some successive rows missing values from one or two of the datasets. We could easily drop all rows with missing values as there are so few of them, but in this case let's fill them with median values in the preprocessing pipeline (median is probably more suitable than mean if the data is a bit skewed).

Transforming the data

Let's look again at the distributions of the numerical values to check for skewness, and then let's try out different transformations and compare results to find the most suitable transformation for different columns.

We can see that AltitudeVariation, VehicleSpeedVariance, VehicleSpeedVariation, LongitudinalAcceleration and VerticalAcceleration all have a good deal of outliers. Now let's visualize the distributions of each feature.

PowerTransformer seems to be the most effective transform for reducing skewness for all columns (although it does increase kurtosis for VehicleSpeedInstantaneous, VehicleSpeedVariation, EngineCoolantTemperature, IntakeAirTemperature and FuelConsumptionAverage.)

PowerTransform has removed most of the outliers in the VehicleSpeedVariance and LongitudinalAcceleration columns but a good deal of the outliers in the AltitudeVariation, VehicleSpeedVariation, VerticalAcceleration columns remain, we should keep this in mind when selecting which features to train the models on.

Preprocessing Pipeline

Feature Selection

Let's use a random forest classifier to find the importance of each feature and from there we can decide which features to train our models with.

Let's remove the five least important features (DrivingStyle, VehicleSpeedVariation, AltitudeVariation, EngineLoad and MassAirFlow) as irrelevant features can lead to overfitting and selecting the most relevant features can increase the accuracy of our models and reduce the computational time.

Multi-Classifier Models

Support Vector Machine (SVM)

We will come back to Random Forest classifiers soon, but first let's try out a Support Vector Machine (SVM) model using the Support Vector Classification (SVC) function of scikit-learn. We will first create a grid of hyperparameters for four different types of SVC kernels and then use GridSearchCV to find the best performing kernel and hyperparameters.

The lines below are just me trying to get the AUC ROC scores for this model (I've manually created the model again using the best esimators, but I think I have to change the probability to true, so I'm not sure if this changes the model too, or just adds another output). Still haven't figured this out, would anybody else like to try?

https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

https://towardsdatascience.com/roc-curve-explained-using-a-covid-19-hypothetical-example-binary-multi-class-classification-bab188ea869c

Decision Trees\ Random Forests

Let's evaluate using Decision trees \ Random forests

Can't see the forest for the trees, so let's start with a single decision tree classifier with a max depth of 4

Let's try to visualise the tree structure

Similar to the issues Donny came across I needed to convert the dot file to a png file using a separate process outside of this notebook using GraphViz's dot.exe directly in a command prompt. dot decision_tree.dot -o decision_tree.png -Tpng

Full decision tree graph

decision_tree.png

Section of decision tree graph showing root and some internal nodes

decision_tree_section.png

Section of decision tree graph showing some internal and leaf nodes

decision_tree_section2.png

We can see that as the max depth increases, the training and validation scores increase reaching an optimal set around a tree depth of 18.

Let's create the decision tree again, and this time not specify the max depth. We will then inspect the actual depth.

A random forest consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction. Let's switch over to a random forest classifier.

The criterion (parameter) of the RandomForestClassifier determines the function used to measure the quality of a split. The default criteria is “gini” for the Gini impurity, the other option is “entropy” for the information gain. Let's configure a random forest classifier using the entropy criteria.

We observe that switching the random forest criteria from gini to entropy had little overall effect on model accuracy.

Max depth and the number of estimators are the two hyper parameters typically adjusted in random forest classifiers so let's see what effect increasing the estimator count has.

A slight improvement was observed, at the expense of performance. Let's use GridSearchCV to attempt to determine the optimal hyper-parameters.